Goto

Collaborating Authors

 vision and pattern recognition


Supplementary Material for Bridging the Domain Gap: Self-Supervised 3DScene Understanding with Foundation Models Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

The masking strategy is set to random and the mask4 ratio m is 60 %.5 Embedding: To embed each masked point patch, the Point-MAE method substitutes it with a mask6 token that is learnable and weighted-shared. Meanwhile, for unmasked point patches (i.e., those that7 are visible), Point-MAE employs a lightweight PointNet [8] to extract features from the point patches.8 The visible point patches Pv are hence embedded into visible tokens Tv:9 Tv = PointNet(Pv) (1) Backbone: The backbone of Point-MAE is entirely based on standard Transformers, with an10 asymmetric encoder-decoder. The encoder takes visible tokens Tv as input to generate encoded11 tokens Te. In addition, Point-MAE incorporates positional embeddings into each Transformer block,12 thereby adding location-based information.


Mip-NeRF 360 Ours GT w/o diffusionw/o background Ours GT PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation

Neural Information Processing Systems

The BlendedMVS [7] dataset is a large-scale synthetic dataset for multi-view 6 stereo containing 113 scenes, which can be further divided into large-scale outdoor scenes part and 7 small-scale objects part according to the scene scale. Since current large-scene NeRF methods are 8 one model per scene, to save computational resources and time, we select the first five scenes of the 9 large-scale outdoor scenes part and compare with Mip-NeRF 360 [2], which is the optimal baseline 10 on the representative subset of OMMO dataset [3] as shown in our manuscript, see Tab. 4 and Figure 1 .


FullyExplicitDynamicGaussianSplatting

Neural Information Processing Systems

Meanwhile, a promising alternative, 3DGaussian Splatting (3DGS) [13], has emerged, which achieves photo-realistic rendering results with significantly faster training and rendering speeds.


Multi-modalSituated Reasoningin3DScenes

Neural Information Processing Systems

Comprehensiveevaluationson MSQA andMSNN highlight thelimitations ofexisting vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling.






Weakly Supervised Dense Event Captioning in Videos

Neural Information Processing Systems

Among the wide variety of applications on video understanding, the video captioning task is attracting more and more interests in recent years [4, 5, 6, 7, 8, 9, 10, 11].


RevisitingDiscriminatorinGANCompression: AGenerator-discriminatorCooperativeCompression Scheme

Neural Information Processing Systems

As shown in Figure 1(b) and 1(c), when compressing the generator, the loss of the discriminator gradually tends to zero, such asituation indicates that the capacity of the discriminator significantly surpasses that of the lightweight generator.